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Abstract 

Multivariate categorical data occur in many ap¬ 
plications of machine learning. One of the main 
difficulties with these vectors of categorical vari¬ 
ables is sparsity. The number of possible obser¬ 
vations grows exponentially with vector length, 
but dataset diversity might be poor in compari¬ 
son. Recent models have gained significant im¬ 
provement in supervised tasks with this data. 
These models embed observations in a continu¬ 
ous space to capture similarities between them. 
Building on these ideas we propose a Bayesian 
model for the unsupervised task of distribution 
estimation of multivariate categorical data. 

We model vectors of categorical variables as 
generated from a non-linear transformation of 
a continuous latent space. Non-linearity cap¬ 
tures multi-modality in the distribution. The con¬ 
tinuous representation addresses sparsity. Our 
model ties together many existing models, link¬ 
ing the linear categorical latent Gaussian model, 
the Gaussian process latent variable model, and 
Gaussian process classification. We derive in¬ 
ference for our model based on recent develop¬ 
ments in sampling based variational inference. 
We show empirically that the model outperforms 
its linear and discrete counterparts in imputation 
tasks of sparse data. 

1 Introduction 


Multivariate categorical data is common in fields ranging 
from language processing to medical diagnosis. Recently 
proposed supervised models have gained significant im¬ 
provement in tasks involving big labelled data of this form 
(see for example Bengio et al. (2006)); Collobert & Weston 


( 2008) 1). These models rely on information that had been 
largely ignored before: similarity between vectors of cate¬ 
gorical variables. But what should we do in the unsuper¬ 
vised setting, when we face small unlabelled data of this 
form? 


Medical diagnosis provides good examples of small unla¬ 
belled data. Consider a dataset composed of test results of 


yg279@CAM.ac.uk 

yc373@CAM.ac.uk 

zoubin@cam.ac.uk 


a relatively small number of patients. Each patient has a 
medical record composed often of dozens of examinations, 
taking various categorical test results. We might be faced 
with the task of deciding which tests are necessary for a 
patient under examination to take, and which examination 
results could be deduced from the existing tests. This can 
be achieved with distribution estimation. 

Several tools in the Bayesian framework could be used 
for this task of distribution estimation of unlabelled small 
datasets. Tools such as the Dirichlet-Multinomial distribu¬ 
tion and its extensions are an example of such. These rely 
on relative frequencies of categorical variables appearing 
with others, with the addition of various smoothing tech¬ 
niques. But when faced with long multivariate sequences, 
these models run into problems of sparsity. This occurs 
when the data consists of vectors of categorical variables 
with most configurations of categorical values not in the 
dataset. In medical diagnosis this happens when there is a 
large number of possible examinations compared to a small 
number of patients. 

Building on ideas used for big labelled discrete data, we 
propose a Bayesian model for distribution estimation of 
small unlabelled data. Existing supervised models for dis¬ 
crete labelled data embed the observations in a continuous 
space. This is used to find the similarity between vectors of 
categorical variables. We extend this idea to the small unla¬ 
belled domain by modelling the continuous embedding as 
a latent variable. A generative model is used to find a dis¬ 
tribution over the discrete observations by modelling them 
as dependent on the continuous latent variables. 

Following the medical diagnosis example, patient n would 
be modelled by a continuous latent variable x n £ X. For 
each examination d, the latent x n induces a vector of prob¬ 
abilities f = (fndi, /ndA')- one probability foreach pos¬ 
sible test result k. A categorical distribution returns test re¬ 
sult y n d based on these probabilities, resulting in a patient’s 
medical assessment y n = (y n i, ■■■,y n D)- We need to de¬ 
cide how to model the distribution over the latent space X 
and vectors of probabilities f. 

We would like to capture sparse multi-modal categorical 
distributions. A possible approach would be to model the 
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continuous representation with a simple latent space and a 
non-linear transformation of points in the space to obtain 
probabilities. In this approach we place a standard normal 
distribution prior on a latent space, and feed the output of a 
non-linear transformation of the latent space into a Softmax 
to obtain probabilities. We use sparse Gaussian processes 
(GPs) to transform the latent space non-linearly. Sparse 
GPs form a distribution over functions supported on a small 
number of points with linear time complexity (Quinonero- 
|Candela & Rasmussen| |2005| |Titsias| |2Q09[ i. We use a co- 
variance function that is able to transform the latent space 
non-linearly. We name this model the Categorical Latent 
Gaussian Process (CLGP). Using a Gaussian process with 
a linear covariance function recovers the linear Gaussian 
model (LGM, Marlin et al. 2011) >. This model linearly 
transform a continuous latent space resulting in discrete ob¬ 
servations. 

The Softmax likelihood is not conjugate to the our Gaus¬ 
sian prior, and integrating the latent variables with a Soft- 
max distribution is intractable. A similar problem exists 
with LGMs. Marlin et al.[( j201 lj i solved this by using vari¬ 
ational inference and various bounds for the likelihood in 
the binary case, or alternative likelihoods to the Softmax in 
the categorical case (Khan et al. 2012| . Many bounds have 
been studied in the literature for the binary case: Jaakkola 
and Jordan’s bound (| Jaakkola & Jordan] | 1997) >, the tilted 
bound ( |Knowles & Minkaj |201 1[>, piecewise linear and 
quadratic bounds (Marlin et al. 20111, and others. But for 
categorical data fewer bounds exist, since the multivariate 
Softmax is hard to approximate in high-dimensions. The 
Bohning bound (Bohning 1992| and Blei and Lafferty’s 
bound dBlei & Lafferty 2006) give poor approximation 
( |Khan et al.||2012) . 


Instead we use recent developments in sampling-based 
variational inference (Blei et al., 2012| ) to avoid integrat¬ 


ing the latent variables with the Softmax analytically. Our 
approach takes advantage of this tool to obtain simple yet 
powerful model and inference. We use Monte Carlo in¬ 
tegration to approximate the non-conjugate likelihood ob- 


taining noisy gradients (Kingma & Welling 2013 Rezende 

et al. 2014; Titsias & Lazaro-Gredilla 

2014). We then use 

learning-rate free stochastic optimisatioi 

l (Tieleman & Hin- 


|ton| |2012| ) to optimise the noisy objective. We leverage 
symbolic differentiation (Theano, Bergstra et al.| 2010| ) to 
obtain simple and modular codJ] 

We experimentally show the advantages of using non-linear 
transformations for the latent space. We follow the ideas 
brought in Paccanaro & Hinton ( 2001 ) > and evaluate the 
models on relation embedding and relation learning. We 


'Python code for the model and inference is given in the ap¬ 
pendix, and available at http : //github. com/yaringal/ 
CLGP 


then demonstrate the utility of the model in the real-world 
sparse data domain. We use a medical diagnosis dataset 
where data is scarce, comparing our model to discrete fre¬ 
quency based models. We use the estimated distribution 
for a task similar to the above, where we attempt to in¬ 
fer which test results can be deduced from the others. We 
compare the models on the task of imputation of raw data 
studying the effects of government terror warnings on po¬ 
litical attitudes. We then evaluate the continuous models on 
a binary Alphadigits dataset composed of binary images of 
handwritten digits and letters, where each class contains a 
small number of images. We inspect the latent space em¬ 
beddings and separation of the classes. Lastly, we evaluate 
the robustness of our inference, inspecting the Monte Carlo 
estimate variance over time. 


2 Related Work 


Our model (CLGP) relates to some key probabilistic mod¬ 
els (fig. [TJ. It can be seen as a non-linear version of the 
latent Gaussian model (LGM, |Khan et aT7| ( 12012) 1) as dis¬ 
cussed above. In the LGM we have a standard normal prior 
placed on a latent space, which is transformed linearly and 
fed into a Softmax likelihood function. The probability 
vector output is then used to sample a single categorical 
value for each categorical variable (e.g. medical test re¬ 
sults) in a list of categorical variables (e.g. medical assess¬ 
ment). These categorical variables correspond to elements 
in a multivariate categorical vector. The parameters of the 
linear transformation are optimised directly within an EM 
framework. Khan et ak| ( |2012 1 avoid the hard task of ap¬ 
proximating the Softmax likelihood by using an alternative 
function (product of sigmoids) which is approximated us¬ 
ing numerical techniques. Our approach avoids this cum¬ 
bersome inference. 



Figure 1. Relations between existing models and the model pro¬ 
posed in this paper (Categorical Latent Gaussian Process)', the 
model can be seen as a non-linear version of the latent Gaussian 
model (left to right, |Khan et al.| ( |2012[ )), it can be seen as a latent 
counterpart to the Gaussian process classification model (back 
to front, |Williams & Rasmussen] ( |2006| l), or alternatively as a dis¬ 
crete extension of the Gaussian process latent variable model (top 
to bottom, |Lawrence|{2005| l). 
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Our proposed model can also be seen as a latent counterpart 
to the Gaussian process classification model (Williams & 


Rasmussen 2006|>, in which a Softmax function is again 


used to discretise the continuous values. The continu¬ 
ous valued outputs are obtained from a Gaussian process, 
which non-linearly transforms the inputs to the classifica¬ 
tion problem. Compared to GP classification where the 
inputs are fully observed, in our case the inputs are la¬ 
tent. Lastly, our model can be seen as a discrete extension 
of the Gaussian process latent variable model (GPLVM, 
Lawrence, |2005j ). This model has been proposed recently 
as means of performing non-linear dimensionality reduc¬ 
tion (counterpart to the linear principal component analysis 
( Tipping & Bishop} 19991) and density estimation in con¬ 
tinuous space. 


3 A Latent Gaussian Process Model for Mul¬ 
tivariate Categorical Data 

We consider a generative model for a dataset Y with N 
observations (patients for example) and D categorical vari¬ 
ables (different possible examinations). The d -th categor¬ 
ical variable in the rc-th observation, y nd , is a categorical 
variable that can take an integer value from 0 to A',/. For 
ease of notation, we assume all the categorical variables 
have the same cardinality, i.e. Kd = K. \/d = 1,..., D. 

In our generative model, each categorical variable y n d fol¬ 
lows a categorical distribution with probability given by a 
Softmax with weights i nd = (0, f ndl , f ndK ). Each 
weight f n dk is the output of a nonlinear function of a Q 
dimensional latent variable x„ £ R^: J r dfc(x rl ). To com¬ 
plete the generative model, we assign an isotropic Gaussian 
distribution prior with standard deviation a 2 for the latent 
variables, and a Gaussian process prior for each of the non¬ 
linear functions. We also consider a set of M auxiliary 
variables which are often called inducing inputs. These 
inputs Z £ R M x lie in the latent space with their cor¬ 
responding outputs U £ ]^ MxDxK lying in the weight 
space (together with f ndk )■ The inducing points are used 
as “support” for the function. Evaluating the covariance 
function of the GP on these instead of the entire dataset 
allows us to perform approximate inference in 0(M 2 N) 
time complexity instead of 0(N 3 ) (where M is the num¬ 
ber of inducing points and N is the number of data points 
( IQuinonero-Candela & Rasmussen!|2005) l). 

The model is expressed as: 

x nq ~ -W(0, a 2 ) (1) 

T dk ~ GP(K d ) 

fndk = dkfX-n)-; tlmdk — dkfam) 

Vnd ~ Softmax(f nd ), 

for n £ [iV] (the set of naturals between 1 and N ), q £ [Q ], 
d £ [D], k £ [A'], m £ [M], and where the Softmax 


distribution is defined as, 

Softmax(?/ = fc; f) = Categorical 

lse(f) = log f 1 + ^2 exp(/ fc /) j , (2) 

\ k' — l ' 

for k = 0,..., K and with fo := 0. 

Following our medical diagnosis example from the in¬ 
troduction, each patient is modelled by latent x„; for 
each examination d, x„ has a sequence of weights 
(fndi, ■■■, fndx), one weight for each possible test result k, 
that follows a Gaussian process; Softmax returns test result 
y n d based on these weights, resulting in a patient’s medical 
assessment y„ = (y nl , ...,y nD )- 

Define i dk = (f ld k, ■ /jvdfe) and define u dk = 
(u ldk ,...,u M dk)- The joint distribution of ( i dk ,u d k ) 
with the latent nonlinear function, T d k , marginalized un¬ 
der the GP assumption, is a multi-variate Gaussian dis¬ 
tribution A/”(0, IQ([X, Z], [X, Z])). It is easy to verify 
that when we further marginalize the inducing outputs, 
we end up with a joint distribution of the form f^fc ~ 
Af(0 , Krf(X, X)), Vd, k. Therefore, the introduction of in¬ 
ducing outputs does not change the marginal likelihood of 
the data Y. These are used in the variational inference 
method in the next section and the inducing inputs Z are 
considered as variational parameters. 

We use the automatic relevance determination (ARD) ra¬ 
dial basis function (RBF) covariance function for our 
model. ARD RBF is able to select the dimensionality of the 
latent space automatically and transform it non-linearly. 


exp (fk) \ 
exp(lse(f ))) ’ 


4 Inference 


The marginal log-likelihood, also known as log-evidence, 
is intractable for our model due to the non-linearity of 
the covariance function of the GP and the Softmax like¬ 
lihood function. We first describe a lower bound of the 
log-evidence (ELBO) by applying Jensen’s inequality with 
a variational distribution of the latent variables following 
|Titsias & Lawrence] ( |2Q 10] >. 

Consider a variational approximation to the posterior dis¬ 
tribution of X, F and U factorized as follows: 

g(X,F,U)= g (X)g(U)p(F|X,U). (3) 


We can obtain the ELBO by applying Jensen’s inequality 

logp(Y) = log J p(X)p(U)p(F|X, U)p(Y|F)XFU 


> J 5(X)?(U)p(F|X,U) 

p(X)p(U)p(F|X, U)p(Y|F) VI?IT 
• l0g g( XMU)p(F|X,U) m 
= - KL(g(X)||p(X)) - KL(g(U)||p(U)) 
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N D 

EE 

n=1d =1 ' 


g(x„)g(U d )p(f nd |x n , U d ) 

■ logp(y n d\fnd)xJ n dVd 


:=£ 


where 


K 


p(fnd\X-n: U d ) — 11 ^ (fndk \ &- n d^dk j ^ nd ) 


with 


fc=i 


3-nd — -K- d MM^d,Mm 


bnd — k-d,nn ^jy^K d ,Mn- 


(4) 

(5) 

( 6 ) 


Notice however that the integration of logp(y„ d |f„ d ) in eq. 
[4] involves a nonlinear function (lse(f) from eq. [2ji and is 
still intractable. Consequently, we do not have an analytical 
form for the optimal variational distribution of q(U) unlike 
in Titsias & Lawrence|( |2010| >. Instead of applying a further 
approximation/lower bound on £, we want to obtain bet¬ 


ter accuracy by following a sampling-based approach (Blei 


etal. 2012 

Kingma & Welling 

2013[ Rezende et al.[ 2014 

Titsias & 

xizaro-Gredilla 2014) to compute the lower 


bound £ and its derivatives with the Monte Carlo method. 


Specifically, we draw samples of x n , U d and f nd from 
<?(x n ), g(U d ), and p(f nd |x„, U d ) respectively and esti¬ 
mate C with the sample average. Another advantage of us¬ 
ing the Monte Carlo method is that we are not constrained 
to a limited choice of covariance functions for the GP that 
is otherwise required for an analytical solution in stan¬ 


dard approaches to GPLVM for continuous data (Titsias & 
|Lawrencel|2010[|Hensman et al.| |201 3| >. 


We consider a mean field approximation for the latent 


points g(X) as in (Titsias & Lawrence 2010) and a joint 


Gaussian distribution with the following factorisation for 

<?( U): 


D K 


? ( u )=nn^i^.^). 


d— 1 k—1 


N Q 

<7( x ) = n n^(x m | m ni ,s 2 ni ) (7) 

n= 1 i= 1 

where the covariance matrix S d is shared for the same cat¬ 
egorical variable d (remember that I\ is the number of val¬ 
ues this categorical variable can take). The KL divergence 
in £ can be computed analytically with the given varia¬ 
tional distributions. The parameters we need to optimise 
oveij^] include the hyper-parameters for the GP 0,/, varia¬ 
tional parameters for the inducing points Z, fi dk , E d , and 
the mean and standard deviation of the latent points m n( , 
Sni‘ 


2 Note that the number of parameters to optimise over can be 
reduced by transforming the latent space non-linearly to a second 
low-dimensional latent space, which is then transformed linearly 
to the weight space containing points f n dk ■ 


4.1 Transforming the Random Variables 


In order to obtain a Monte Carlo estimate to the gradients of 


£ with low variance, a useful trick introduced in (Kingma 


& Welling 2013|) is to transform the random variables to 


be sampled so that the randomness does not depend on the 
parameters with which the gradients will be computed. We 
present the transformation of each variable to be sampled 
as follows: 


Transforming X. For the mean field approximation, the 
transformation for X is straightforward as 

Xni = m ni + SniZnii £ ni ~ N(0, 1 ) ( 8 ) 

Transforming u dk- The variational distribution of u,//, : is 
a joint Gaussian distribution. Denote the Cholesky decom¬ 
position of X d as L d Lj = £ d . We can rewrite u dk as 

u rffc = Vdk + L d£ ( dk , £ dk ~ -AA(0,1 M ) (9) 

We optimize the lower triangular matrix L d instead of E d . 


Transforming f„ d . Since the conditional distribution 
p(f nd |x„, U d ) in Eq.[5]is factorized over k we can define a 
new random variable for every f nd k- 

fndk = a nd u dk + V^d^k, ^'{dk ~ ^( 0 , X ) ( 10 ) 

Notice that the transformation of the variables does not 
change the distribution of the original variables and there¬ 
fore does not change the KL divergence in Eq. [5] 

4.2 Lower Bound with Transformed Variables 


Given the transformation we just defined, we can represent 

the lower bound as 

N Q 

£ = -EE KL(q(x ni )\\p(x ni )) 

n—1 i =1 
D K 

-EE KL(g(u dfc )||p(u dfe )) 

d— 1 k—1 

N D p 

+ EE^M 

n—1 d—1 L 


log Softmax y„ d 


f «d ( £ nL lJ d(£ { d ) )^n(£ { n ) ))) 


(ID 

where the expectation in the last line is with respect to the 
fixed distribution defined in Eqs. [8] [9] and [TO] Each expec¬ 
tation term that involves the Softmax likelihood, denoted 
as £ n s d , can be estimated using Monte Carlo integration as 

£Ud ^ 


1 

T 


T 

log Softmax 

»=l 



*-nd 


( e 


(/) 

nd,i'> 


U d (e 


(«)\ 
d,i /’ 




%) 


( 12 ) 

with Ej i i £,'{J , drawn from their corresponding dis¬ 
tributions. Since these distributions do not depend on the 
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parameters to be optimized, the derivatives of the objective 
function C are now straight-forward to compute with the 
same set of samples using the chain rule. 

4.3 Stochastic Gradient Descent 


We use gradient descent to find an optimal variational dis¬ 
tribution. Gradient descent with noisy gradients is guaran¬ 
teed to converge to a local optimum given decreasing learn¬ 
ing rate with some conditions, but is hard to work with in 
practice. Initial values set for the learning rate influence 
the rate at which the algorithm converges, and misspeci- 
fied values can cause it to diverge. For this reason new 
techniques have been proposed that handle noisy gradients 
well. Optimisers such as AdaGrad (Duchi et al. 201 1) , 
AdaDelta ( |Zeiler [ [20 1 2| > , and RMSPROP ( Tieleman & Hin-| 
ton, |2012] ) have been proposed, each handling the gradi¬ 
ents slightly differently, all averaging over past gradients. 
Schaul et al. (|2013 i have studied empirically the different 


techniques, comparing them to one another on a variety 
of unit tests. They found that RMSPROP works better on 
many test sets compared to other optimisers evaluated. We 
thus chose to use RMSPROP for our experiments. 

A major advantage of our inference is that it is extremely 
easy to implement and adapt. The straight-forward com¬ 
putation of derivatives through the expectation makes it 
possible to use symbolic differentiation. We use Theano 
(Ber gstra et al.[ 2010| l for the inference implementation, 
where the generative model is implemented as in Eqs. [8] 
[9]and[T0] and the optimisation objective, evaluated on sam¬ 
ples from the generative model, is given by Eq. 11 


5 Experimental Results 


We evaluate the categorical latent Gaussian process 
(CLGP), comparing it to existing models for multivariate 
categorical distribution estimation. These include mod¬ 
els based on a discrete latent representation (such as the 
Dirichlet-Multinomial), and continuous latent representa¬ 
tion with a linear transformation of the latent space (latent 
Gaussian model (LGM, |Khan et ak| |2012| >). We demon¬ 
strate over-fitting problems with the LGM, and inspect the 
robustness of our inference. 


For the following experiments, both the linear and non¬ 
linear models were initialised with a 2D latent space. The 
mean values of the latent points, m n , were initialised at ran¬ 
dom following a standard normal distribution. The mean 
values of the inducing outputs (pt dk ) were initialised with 
a normal distribution with standard deviation 10 -2 . This 
is equivalent to using a uniform initial distribution for all 
values. We initialise the standard deviation of each latent 
point (s n ) to 0.1, and initialise the length-scales of the ARD 
RBF covariance function to 0.1. We then optimise the vari¬ 
ational distribution for 500 iterations. At every iteration we 
optimise the various quantities while holding u,//Ts varia¬ 


tional parameters fixed, and then optimise u^.’s variational 
parameters holding the other quantities fixed. 

Our setting supports semi-supervised learning with par¬ 
tially observed data. The latents of partially observed 
points are optimised with the training set, and then used 
to predict the missing values. We assess the performance 
of the models using test-set perplexity (a measure of how 
much the model is surprised by the missing test values). 
This is defined as the exponent of the negative average log 
predicted probability. 

5.1 Linear Models Have Difficulty with Multi-modal 
Distributions 


We show the advantages of using a non-linear categorical 
distribution estimation compared to a linear one, evaluating 
the CLGP against the linear LGM. We implement the latent 
Gaussian model using a linear covariance function in our 
model; we remove the KL divergence term in u following 
the model specification in (Khan et al., |2012[ ), and use our 
inference scheme described above. Empirically, the Monte 
Carlo inference scheme with the linear model results in the 
same test error on (Inoguchij 2008| as the piece-wise bound 
based inference scheme developed in (Khan et al. 20121. 


5.1.1 Relation Embedding - Exclusive Or 


A simple example of relational learning (following Pac- 
|canaro & Hinton] ( |2001[ )) can be used to demonstrate 
when linear latent space models fail. In this task we are 
given a dataset with example relations and the model is 


Prob of 1 for d=0 Prob of 1 for d= 1 Prob of 1 for d=2 



(b) CLGP 

Figure 2. Density over the latent space for XOR as predicted 
by the linear model (top, LGM), and non-linear model (bot¬ 
tom, CLGP). Each figure shows the probability p(yd = l|x) 
as a function of x, for d = 0,1, 2 (first, second, and third dig¬ 
its in the triplet left to right). The darker the shade of green, the 
higher the probability. In blue are the latents corresponding to the 
training points, in yellow are the latents corresponding to the four 
partially observed test points, and in red are the inducing points 
used to support the function. 
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(a) LGM 


(b) CLGP (c) Ground truth 


Figure 3. Histogram of categorical values for XOR (encoded 
in binary for the 8 possible values) for samples drawn from the 
posterior of the latent space of the linear model (left, LGM), the 
non-linear model (middle, CLGP), and the data used for training 
(right). 


to capture the distribution that generated them. A non¬ 
linear dataset is constructed using the XOR (exclusive 
or) relation. We collect 25 positive examples of each 
assignment of the binary relation (triplets of the form 
(0,0,0), (0,1,1), (1,0,1), (1,1,0), corresponding to 
0 XOR 1 = 1 and so on). We then maximise the vari¬ 
ational lower bound using RMSPROP for both the lin¬ 
ear and non-linear models with 20 samples for the Monte 
Carlo integration. We add four more triplets to the dataset: 
(0,0,?), (0,1,?), (1,0,?), (1,1,?). We evaluate the 
probabilities the models assign to each of the missing val¬ 
ues (also known as imputation) and report the results. 

We assessed the test-set perplexity repeating the experi¬ 
ment 3 times and averaging the results. The linear model 
obtained a test-set perplexity (with standard deviation) of 
75.785 ± 13.221, whereas the non-linear model obtained 
a test-set perplexity of 1.027 ± 0.016. A perplexity of 1 
corresponds to always predicting correctly. 

During optimisation the linear model jumps between differ¬ 
ent local modes, trying to capture all four possible triplets 
(fig# The linear model assigns probabilities to the miss¬ 
ing values by capturing some of the triplets well, but cannot 
assign high probability to the others. In contrast, the CLGP 
model is able to capture all possible values of the relation. 
Sampling from probability vectors from the latent varia¬ 
tional posterior for both models, we get a histogram of the 
posterior distribution (fig. [3]!. As can be seen, the CLGP 
model is able to fully capture the distribution whereas the 
linear model is incapable of it. 

5.1.2 Relation Learning - Complex Example 

We repeat the previous experiment with a more com¬ 
plex relation. We generated 1000 strings from a 
simple probabilistic context free grammar with non¬ 
terminals {a, /3, A, B, C}, start symbol a, terminals 
{a, b, c, d, e, /, g }, and derivation rules: 

a -» A/3 [1.0] 

/3 -¥ BA [0.5] | C [0.5] 

A -> a [0.5] | b [0.3] | c [0.2] 


B ^ d [0.7] | e [0.3] 

C^f [0.7] | g [0.3] 

where we give the probability of following a derivation rule 
inside square brackets. Following the start symbol and the 
derivation rules we generate strings of the form ada, af, 
ceb, and so on. We add two “start” and “end” symbols 
s,s to each string obtaining strings of the form sscebss. 
We then extract triplets of consecutive letters from these 
strings resulting in training examples of the form ssc, see , 
..., bss. This is common practice in text preprocessing in 
the natural language community. It allows us to generate 
relations from scratch by conditioning on the prefix ss to 
generate the first non-terminal (say a), then condition on 
the prefix sa, and continue in this vein iteratively. Note 
that a simple histogram based approach will do well on this 
dataset (and the previous dataset) as it is not sparse. 


Split 

LGM 

CLGP 

1 

7.898 ± 3.220 

2.769 ±0.070 

2 

26.477 ± 23.457 

2.958 ±0.265 

3 

6.748 ± 3.028 

2.797 ±0.081 


Table 1. Test-set perplexity on relational data. Compared are 
the linear LGM and the non-linear CLGP. 


We compared the test-set perplexity of the non-linear 
CLGP (with 50 inducing points) to that of the linear LGM 
on these training inputs, repeating the same experiment set¬ 
up as before. We impute one missing value at random in 
each test string, using 20% of the strings as a test set with 
3 different splits. The results are presented in table [T] The 
linear model cannot capture this data well, and seems to get 
very confident about misclassified values (resulting in very 
high test-set perplexity). 

5.2 Sparse Small Data 

We assess the model in the real-world domain of small 
sparse data. We compare the CLGP model to a his¬ 
togram based approach, demonstrating the difficulty with 
frequency models for sparse data. We further compare our 
model to the linear LGM. 


5.2.1 Medical Diagnosis 


We use the Wisconsin breast cancer dataset for which ob¬ 
taining samples is a long and expensive process (Zwitter & 
Soklicj 19881. The dataset is composed of 683 data points, 
with 9 categorical variables taking values between 1 and 
10, and an additional categorical variable taking 0,1 values 
- 2 x 10 9 possible configurations. We use three quarters 
of the dataset for training and leave the rest for testing, av¬ 
eraging the test-set perplexity on three repetitions of the 
experiment. We use three different random splits of the 
dataset as the distribution of the data can be fairly different 
among different splits. In the test set we randomly remove 
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Split 

Baseline 

Multinomial 

Uni-Dir-Mult 

Bi-Dir-Mult 

LGM 

CLGP 

1 

8.68 

4.41 

4.41 

3.41 

3.57 ±0.208 

2.86 ± 0.119 

2 

8.68 

OO 

4.42 

3.49 

3.47 ±0.252 

3.36 ± 0.186 

3 

8.85 

4.64 

4.64 

3.67 

12.13 ±9.705 

3.34 ±0.096 


Table 2. Test-set perplexity on Breast cancer dataset, predicting randomly missing categorical test results. The models compared 
are: Baseline predicting uniform probability for all values, Multinomial - predicting the probability for a missing value based on its 
frequency, Uni-Dir-Mult - Unigram Dirichlet Multinomial with concentration parameter a = 0.01, Bi-Dir-Mult - Bigram Dirichlet 
Multinomial with concentration parameter a = 1, LGM , and the proposed model ( CLGP ). 


one of the 10 categorical values, and test the models’ abil¬ 
ity to recover that value. Note that this is a harder task than 
the usual use of this dataset for binary classification. We 
use the same model set-up as in the first experiment. 

We compare our model {CLGP) to a baseline predicting 
uniform probability for all values (Baseline), a multino¬ 
mial model predicting the probability for a missing value 
based on its frequency (Multinomial), a unigram Dirichlet 
Multinomial model with concentration parameter a = 0.01 
( Uni-DirMult ), a bigram Dirichlet Multinomial with con¬ 
centration parameter a = 1 ( Bi-Dir-Mult ), and the lin¬ 
ear continuous space model (LGM). More complicated fre¬ 
quency based approaches are possible, performing variable 
selection and then looking at frequencies of triplets of vari¬ 
ables. These will be very difficult to work with in this 
sparse small data problem. 

We evaluate the models’ performance using the test-set per¬ 
plexity metric as before (table [2|. As can be seen, the 
frequency based approaches obtain worse results than the 
continuous latent space approaches. The frequency model 
with no smoothing obtains perplexity oo for split 2 because 
one of the test points has a value not observed before in 
the training set. Using smoothing solves this. The baseline 
(predicting uniformly) obtains the highest perplexity on av¬ 
erage. The linear model exhibits high variance for the last 
split, and in general has higher perplexity standard devia¬ 
tion than the non-linear model. 

5.2.2 Terror Warning Effects on Political Attitude 

We further compare the models on raw data collected in 
an experimental study of the effects of government terror 
warnings on political attitudes (from the START Terrorism 
Data Archive DataverseQ. This dataset consists of 1282 
cases and 49 variables. We used 1000 cases for training 
and 282 for test. From the 49 variables, we used the 17 cor¬ 
responding to categorical answers to study questions (the 
other variables are subjects’ geographic info and an answer 
to an open question). The 17 variables take 5 or 6 possible 
values for each variable with some of the values missing. 

We repeat the same experiment set-up as before, with a 6 

3 Obtained from the Harvard Dataverse Network 

thedata.harvard.edu/dvn/dv/start/faces/ 
study/StudyPage.xhtml?studyId=71190 


Baseline 

Multinomial 

LGM 

CLGP 

2.50 

2.20 

3.07 

2.11 


Table 3. Test-set perplexity for terror warning effects on polit¬ 
ical attitude. Compared are the linear LGM and the non-linear 
CLGP. 


dimensional latent dimensions, 100 inducing points, 5 sam¬ 
ples to estimate the lower bound, and running the optimiser 
for 1000 iterations. We compare a Baseline using a uni¬ 
form distribution for all values, a Multinomial model using 
the frequencies of the individual variables, the linear LGM, 
and the CLGP. The results are given in table [3] On the 
training set, the CLGP obtained a training error of 1.56, 
and the LGM obtained a training error of 1.34. The linear 
model seems to over-fit to the data. 


5.2.3 Handwritten Binary Alphadigits 


We evaluate the performance of our model on a sparse 
dataset with a large number of dimensions. The dataset. 
Binary Alphadigits, is composed of 20 x 16 binary images 
of 10 handwritten digits and 26 handwritten capital letters, 
each class with 39 image^] We resize each image to 10 x 8, 
and obtain a dataset of 1404 data points each with 80 binary 
variables. We repeat the same experiment set-up as before 
with 2 latent dimensions for ease of visual comparison of 
the latent embeddings. Fig. 4a shows an example of each 
alphadigit class. Each class is then randomly split to 30 
training and 9 test examples. In the test set, we randomly 
remove 20% of the pixels and evaluate the prediction error. 


Fig. [4b] shows the training and test error in log perplex¬ 
ity for both models. LGM converges faster than CLGP 
but ends up with a much higher prediction error due to its 
limited modeling capacity with 2 dimensional latent vari¬ 
ables. This is validated with the latent embedding shown 
in fig. [4c] and [4d] Each color-marker combination denotes 
one alphadigit class. As can be seen, CLGP has a better 
separation of the different classes than LGM even though 
the class labels are not used in the training. 

4 Obtained from http: / /www. cs.nyu. edu / -roweis/ 
data.html 
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(a) Example alphadigits (b) Train & Test log perplexity (c) LGM latent space (d) CLGP latent space 


Figure 4. Binary alphadigit dataset, (a) example of each class (b) Train and test log perplexity for 1000 iterations. (c,d) 2-D latend 
embedding of the 36 alphadigit classes with LGM (left) and CLGP model (right). Each color-marker combination denotes one class. 


5.3 Latent Gaussian Model Over-fitting 

It is interesting to note that the latent Gaussian model 
(LGM) over-fits on the different datasets. It is possible 
to contribute this to the lack of regularisation over the lin¬ 
ear transformation - the weight matrix used to transform 
the latent space to the Softmax weights is optimised with¬ 
out a prior. In all medical diagnosis experiment repetitions, 
for example, the model’s training log perplexity decreases 
while the test log perplexity starts increasing (see fig. [5]for 
the train and test log perplexities of split 1 of the breast 
cancer dataset). Note that even though the test log perplex¬ 
ity starts increasing, at its lowest point it is still higher than 
the end test log perplexity of the CLGP model. This is ob¬ 
served for all splits and all repetitions. 


phadigit dataset. It seems that the estimator standard devia¬ 
tion decreases as the approximating variational distribution 
fits to the posterior. This makes the stochastic optimisa¬ 
tion easier, speeding up convergence. A further theoretical 
study of this behaviour in future research will be of interest. 

6 Discussion and Conclusions 

We have presented the first Bayesian model capable of cap¬ 
turing sparse multi-modal categorical distributions based 
on a continuous representation. This model ties together 
many existing models in the field, linking the linear and 
discrete latent Gaussian models to the non-linear continu¬ 
ous space Gaussian process latent variable model and to 
the fully observed discrete Gaussian process classification. 


It is worth noting that we compared the CLGP to the LGM 
on the ASES survey dataset ( Inoguchi[ [2008 ) used to assess 
the LGM in (Khan et al. 2012| . We obtained a test perplex¬ 
ity of 1.98, compared to LGM’s test perplexity of 1.97. We 
concluded that the dataset was fairly linearly separable. 


5.4 Inference Robustness 


Lastly, we inspect the robustness of our inference, evaluat¬ 
ing the Monte Carlo (MC) estimate standard deviation as 
optimisation progresses. Fig. [6] shows the ELBO and MC 
standard deviation per iteration (on log scale) for the Al¬ 



ta) LGM (b) CLGP 


Figure 5. Train and test log perplexities for LGM (left) and the 
CLGP model (right) for one of the splits of the breast cancer 
dataset. The train log perplexity of LGM decreases while the test 
log perplexity starts increasing at iteration 50. 


In future work we aim to answer short-comings in the cur¬ 
rent model such as scalability and robustness. We scale the 


model following research on GP scalability done in (Hens- 


|man et ah] |2013[ |Gal et alT| |2014| >. The robustness of the 
model depends on the sample variance in the Monte Carlo 
integration. As discussed in Blei et al. ( |2Q12| >, variance 
reduction techniques can help in the estimation of the inte¬ 


gral, and methods such as the one developed in Wang et al. 


(2013j> can effectively increase inference robustness. 



Figure 6. ELBO and standard deviation per iteration (on log 

scale) for the Alphadigits dataset. 
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1 

2 

3 

4 

5 
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7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 


The basic model and inference (without covariance matrix 
caching) can be implemented in 20 lines of Python and 
Theano for each categorical variable: 

import theano.tensor as T 

m = T.dmatrix( 'm' ) # ..and other variables 

X = m + s * randn (N, Q) 

U = mu + L . dot (randn (M, K) ) 

Kmm = RBF (sf2, 1, Z) 

Kmn = RBF (sf2, 1, Z, X) 

Knn = RBFnn(sf2, 1, X) 

Kmmlnv = T.matrix_inverse(Kmm) 

A = Kmmlnv.dot(Kmn) 

B = Knn - T.sum(Kmn * Kmmlnv.dot(Kmn), 0) 

F = A.T.dot(U)+B[:, None] **0.5 * randn(N,K) 
S = T.nnet.softmax(F) 

KL_U, KL_X = get_KL_U () , get_KL_X () 

LS = T . sum (T . log (T . sum (Y * S, 1))) 

- KL_U - KL_X 

LS_func = theano.function (['''inputs'''] , 
LS) 

dLS_dm = theano.function (['''inputs' , 
T.grad(LS, m)) # and others 

# ... and optimise LS with RMS-PROP 




